Computer-implemented method for correcting transmission errors using linear programming

ABSTRACT

A computer-implemented method for correcting transmission errors. According to the method, a transmitted vector corrupted by error can be recovered solving a linear program. The method has applications in the transmission of Internet media, Internet telephony, and speech transmission. In addition, error correction is embedded as a key building block in numerous algorithms, and data-structures, where corruption is possible; corruption of digital data stored on a hard-drive, CD, DVD or similar media is a good example. In short, progress in error correction has potential to impact several storage and communication systems.

CROSS-REFERENCE TO RELATED APPLICATIONS

This application claims the benefit of U.S. provisional PatentApplication Ser. No. 60/652,992, filed Feb. 14, 2005 for “ErrorCorrection Using Linear Programming” by Emmanuel Candes and Terence Tao,the disclosure of which is incorporated herein by reference.

FEDERAL SUPPORT

This invention was made with U.S. Government support under grant No. DMS01-40698 (FRG) awarded by the National Science Foundation. The U.S.Government has certain rights in this invention.

BACKGROUND

1. Field

The present disclosure relates to a method for correcting errors. Inparticular, it relates to a computer-implemented method for correctingtransmission errors in a physical system.

2. Related Art

Finding sparse solutions to underdetermined systems of linearequations—a problem of great importance in signal processing, and designof robust communication channels—is in general NP-hard [9,21]. Forexample, the sparsest solution is given by

${\left( P_{0} \right){\min\limits_{d \in R^{m}}{{d}_{l_{0}}\mspace{14mu} {subject}\mspace{14mu} {to}\mspace{14mu} {Fd}}}} = {\overset{\sim}{y}\left( {= {Fe}} \right)}$

and solving this problem essentially requires exhaustive searches overall subsets of columns of F, a procedure which clearly is combinatorialin nature and has exponential complexity.

This computational intractability has recently led researchers todevelop alternatives to (P₀), and a frequently discussed approachconsiders a similar program in the l₁-norm which goes by the name ofBasis Pursuit [6]:

${\left( P_{1} \right){\min\limits_{d \in R^{m}}{{d}_{l_{1}}\mspace{14mu} {subject}\mspace{14mu} {to}\mspace{14mu} {Fd}}}} = \overset{\sim}{y}$

where we recall that ∥d∥_(l) ₁ =Σ_(i=1) ^(m)|d_(i)|. Unlike the l₀-normwhich enumerates the nonzero coordinates, the l₁-norm is convex. It isalso well-known [7] that (P₁) can be recast as a linear program (LP).

Motivated by the problem of finding sparse decompositions of specialsignals in the field of mathematical signal processing and followingupon the groundbreaking work of Donoho and Huo [11], a series ofbeautiful articles [16, 12, 13, 14, 24] showed exact equivalence betweenthe two programs (P₀) and (P₁). In a nutshell, it has been shown thatfor m/2 by m matrices F obtained by concatenation of two orthonormalbases, the solution to both (P₀) and (P₁) are unique and identicalprovided that in the most favorable case, the vector e has at most0.914√{square root over (m/2)} nonzero entries. This is of littlepractical use here since we are interested in procedures that mightrecover a signal when a constant fraction of the output is unreliable.

Using very different ideas and together with Romberg [3], the applicantsproved that the equivalence holds with overwhelming probability forvarious types of random matrices provided that the number of nonzeroentries in the vector e be of the order of m/log m [4, 5]. In thespecial case where F is an m/2 by m random matrix with independentstandard normal entries, [9] proved that the number of nonzero entriesmay be as large as ρ·m, where ρ>0 is some very small and unspecifiedpositive constant independent of m.

SUMMARY

The present application considers the model problem of recovering aninput vector fεR^(n) from corrupted measurements y=Af+e. Here, A is an mby n matrix (we will assume throughout that m>n), and e is an arbitraryand unknown vector of errors. The problem considered by the applicantsis whether it is possible to recover f exactly from the data y. And ifso, how?

In its abstract form, this problem is of course equivalent to theclassical error correcting problem which arises in coding theory as Amay be thought of as a linear code; a linear code is a given collectionof codewords which are vectors a₁, . . . , a_(n) εR^(m)—the columns ofthe matrix A. Given a vector fεR^(n) (the plaintext), a vector Af inR^(m) (the ciphertext) can be generated; we refer to the transformationfrom f to Af as the encoding scheme. If A has full rank, then one canclearly recover the plaintext f from the ciphertext Af by standardlinear algebra methods. But now we suppose that the ciphertext Af iscorrupted by an arbitrary vector eεR^(m) giving rise to the corruptedciphertext Af+e. The question is then: given the coding matrix A andAf+e, can one recover f exactly?

As is well-known, if the fraction of the corrupted entries is too large,then of course there is no hope of reconstructing f from Af+e; forinstance, assume that m=2n and consider two distinct plaintexts f, f′and form a vector gεR^(m) by setting half of its m coefficients equal tothose of Af and half of those equal to those of Af′. Then g=Af+e=Af′+e′where both e and e′ are supported on sets of size at most n=m/2. Thissimple example shows that accurate decoding is impossible when the sizeof the support of the error vector is greater or equal to a half of thatof the output Af. Therefore, a common assumption in the literature is toassume that only a small fraction of the entries are actually damaged:

∥e∥ _(l) ₀ :=|{i: e _(i)≠0}|≦ρ·m.

For which values of ρ can we hope to reconstruct e with practicalalgorithms? That is, with algorithms whose complexity is at mostpolynomial in the length m of the code A? We refer to such algorithms asdecoding schemes.

To decode f, note that it is obviously sufficient to reconstruct thevector e since knowledge of Af+e together with e gives Af, andconsequently f since A has full rank. The decoding scheme of the presentapplication is then as follows. The applicants construct a matrix whichannihilates the m by n matrix A on the left, i.e. such that FA=0. Thiscan be done by taking a matrix F whose kernel is the range of A inR^(m), which is an n-dimensional subspace (e.g. F could be theorthogonal projection onto the cokernel of A). Then apply F to theoutput y=Af+e to obtain

{tilde over (y)}=F(Af+e)=Fe

since FA=0. Therefore, the decoding problem is reduced to that ofreconstructing a sparse vector e from the observations Fe (by sparse,the applicants mean that only a fraction of the entries of e arenonzero).

Therefore, the present application shows that a novel method thatenables decoding of a transmitted vector of data in the presence ofcorruption of the original values.

According to a first aspect, a computer-implemented method forcorrecting transmission errors is provided, comprising the steps of:transmitting a vector using a linear code, said vector becomingcorrupted by error upon transmission; and recovering the vector bysolving a linear program in form of a convex optimization problem

$\min\limits_{x \in R^{n}}{{y - {Ax}}}_{l_{1}^{\prime}}$

wherein y is the corrupted vector, A is a coding matrix expressing thelinear code, and l₁ is a norm such that

${x}_{l_{1}}:={\sum\limits_{i}{{x_{i}}.}}$

According to a second aspect, a computer system for simulating aphysical process is disclosed, comprising: means for storing in memory areceived vector y, said received vector being corrupted by error; meansfor storing in memory a linear code A used to transmit said receivedvector; means for storing in memory a solution vector x; means forperforming operations on A, x and y, the operations comprisingperforming

$\min\limits_{x \in R^{n}}{{y - {Ax}}}_{l_{1}}$

wherein l₁ represents a norm such that

${x}_{l_{1}}:={\sum\limits_{i}^{\;}\; {x_{i}}}$

According to a third aspect, a method for transmitting packets over acommunication network is disclosed, comprising: encoding the packetsusing a coding matrix; transmitting the encoded packets; and decodingthe transmitted encoded packets by solving a linear program

${\min\limits_{x \in R^{n}}{{y - {Ax}}}_{l_{1}}},$

wherein A is the coding matrix, y is a vector expressing the transmittedencoded packets and l₁ represents a norm such that

${x}_{l_{1}}:={\sum\limits_{i}^{\;}\; {{x_{i}}.}}$

The present application is relevant in the design of practicalcommunication networks (especially those of a digital nature) over noisychannels, as for instance arises in Internet, telephony, orelectromagnetic communications. For instance, if a data vector isencoded using a random matrix A (thus expanding the length of the data,say by a factor of 4) and then each component of the ciphertext thusobtained is sent through the Internet as a separate packet, one cantolerate a fixed fraction (say 30%) of the packets being lost,corrupted, or delayed, and still be able to reconstruct the originaldata exactly. Of course, the matrix A has to be known for data recoveryto work, but A can be generated by both transmitter and receiver througha standard random number generator with some previously agreed uponseed. It should be remarked that even if an adversary has the ability tochoose which packets to destroy or corrupt, and what data to replacethem with, this does not affect the ability to reconstruct the data solong as the adversary is only able to affect a fixed fraction of thepackets. Also, as the protocol is non-adaptive, there is no need toquery the transmitter at any stage in order to perform recovery; inparticular, the data can be collected passively and decoded at someconvenient later date. Note that because one is likely to recover thedata in perfect condition, one can safely pre-process the data (e.g. bycompressing it) even if such pre-processing is very sensitive to thechange of even a single bit of the data.

The present application also has relevance to robust data storage andrecovery protocols. Traditional error-correcting storage protocols, suchas those employed on modern CDs and DVDs, allow one to tolerate smallamounts of damage (such as scratches), and even large damage willcorrupt only a fraction of the data. The algorithm presented here canachieve a similar performance (by dividing the data into blocks andapplying the algorithm on each block separately), but can also scale toincrease the robustness of the data, at the cost of speed of recovery.In the most extreme case, when the entire contents of a DVD are encodedas a single unit, (thus increasing reducing the capacity of the data bya factor of two to four, though this can be compensated for somewhat byfirst compressing the data), the data can then be recovered perfectly aslong as no more than 10%-30% of the disk is damaged, though recovery inthis case may take a significant amount of time (e.g. months) due theneed to perform an immense amount of linear algebra. Again, the locationof the damage is not important; deleting a consecutive sector of 30% ofthe disk can be handled just as easily as damage to 30% of the disk thatis distributed randomly. Also, as before, one can safely pre-process thedata in any desired manner.

Both the encoding and decoding aspects of the algorithm in accordancewith the present disclosure can be easily and efficiently implemented inany computer architecture. The encoding scheme, being the application ofa single linear matrix f

Af, is extremely fast and can even be performed by very simple,low-power devices at high speeds. One can for instance imagine a “smartdust” application consisting of a large number of cheap devicesmeasuring some data (represented mathematically as a vector, f), witheach device recording and then transmitting a single component of Af,thus the sole purpose of each device is to perform and then transmit asingle random linear measurement (together with some identifying code tosignify which component of Af is being transmitted). Even if asignificant fraction of these devices fail, are jammed, or arecompromised to transmit false data, the algorithm will still be able torecover f perfectly. The decoding scheme, requiring the solution of alinear program, does require more computational power, but standardlibrary packages can easily implement this program in an efficientmanner, and for data of small size (e.g. vectors of at most 1024 entriesin length) the decoding is essentially instantaneous. It is conceivablethat in some cases where the data is of large size (e.g. several millionentries), one could take advantage of parallel processing techniques tospeed up the decoding.

The democratic nature of LP codes makes them particularly well-suitedfor distributing data to multiple recipients. Suppose, for example, thatwe wish to distribute a large piece of data from a satellite to manydifferent users. The LP code for this data can be divided into smallpackets, which are in turn broadcast to the users. The quality of thechannel between the satellite and each user could be very different, andcould change over time, meaning that each user will successfully receivea different subset of the packets. Their ability to decode the data,however, depends only on the *number* of packets received, and not atall on *which* packets are received. Successful transmission is possibleeven though there is no feedback from any of the users about whichpackets are missing. Another possible application is robust distributedstorage. Suppose a large file is LP coded, divided into packets, andthen distributed over an array. If some of the disks fail, the data canstill be retrieved from the subset of packets still available.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1, used to illustrate Lemma 8, shows the behaviour of the upperbound ρ_(p/m)(r) for the three values of the ratio p/m, namely p/m=¾, ⅔,½

FIGS. 2( a) and 2(b) show l₁ recovery of an input signal from y=Af+ewith A an m by n matrix with independent Gaussian entries. In thisexperiment, we ‘over sample’ the input signal by a factor of 2 so thatm=2n. (a) Success rate of (P1) for m=512. (b) Success rate of (P1) form=1024. Observe the similar pattern and cut-off point. In theseexperiments, exact recovery occurs as long as about 17% or less of theentries are corrupted.

FIG. 3 shows l₁ recover of an input signal from y=Af+e with A an m by nmatrix with independent Gaussian entries. In this experiment, theapplicants ‘over sample’ the input signal by a factor 4, thus m=4n. Inthese experiments, exact recovery occurs as long as about 34% or less ofthe entries are corrupted.

FIG. 4 describes abstractly how the encoding and decoding scheme couldbe implemented in practice, starting with arbitrary data or data packetsf (10) and then obtaining decoded data or data packets f# (20) whichwill equal the original data f if the fraction of transmitted data thathas been corrupted is below a certain threshold. For instance, thetransmission could be done over the Internet (30) (with errors comingfrom packet loss or malicious corruption of data), with one computer(40) performing the encoding operation f

Af and another computer (50) performing the decoding operation (P₁) or(P′₁). Alternatively, the transmission could be via physical datastorage (30) (to a hard disk, CD or DVD, with the error now arising fromphysical damage to the storage device), with encoding and decoding againperformed by computer. A third possibility is with the encodingperformed by a large number of cheap (and not necessarily reliable)low-power electronic measurement devices, the transmission conductedover open air using some electromagnetic frequency (and thus subject tointerference or attenuation), and the data then collected by somereceiver and decoded by a computer.

FIG. 5 shows a possible example of a computer (100) embodying the methodin accordance with the present disclosure. The computer (100) comprisesa memory (110) and a processor (120). The memory (110) can store areceived vector corrupted by error a liner code used to transmit thereceived vector, and a solution vector. The processor (120) can performoperations on the stored variables by way of the method describedherein.

DETAILED DESCRIPTION

The present application introduces the concept of a restrictedly almostorthonormal system—a collection of vectors which behaves like an almostorthonormal system but only for sparse linear combinations. Thinkingabout these vectors as the columns of the matrix F, the applicants showthat this condition allows for the exact reconstruction of sparse linearcombination of these vectors, i.e. e. The applicants' results aresignificantly different than those mentioned above as they aredeterministic and do not involve any kind of randomization, althoughthey can of course be specialized to random matrices. For instance, itshall be shown that a Gaussian matrix with independent entries sampledfrom the standard normal distribution is restrictedly almost orthonormalwith overwhelming probability, and that minimizing the l₁-norm recoverssparse decompositions with a number of nonzero entries of size ρ₀·m; theapplicants shall actually give numerical values for ρ₀.

The applicants presented the connection with sparse solutions tounderdetermined systems of linear equations merely for pedagogicalreasons. There is a more direct approach. To decode f from corrupteddata y=Af+e, we consider solving the following l₁-minimization problem

$\left( {P\;}_{1}^{\prime} \right){\min\limits_{g \in R^{n}}{{{y - {Ag}}}_{l_{1}}.}}$

Now f is the unique solution of (P₁′) if and only if e is the uniquesolution of (P₁). In other words, (P₁) and (P₁′) are equivalentprograms. To see why these is true, observe on the one hand that sincey=Af+e, we may decompose g as g=f+h so that

$\left. \left( {P\;}_{1}^{\prime} \right)\Leftrightarrow{\min\limits_{h \in R^{n}}{{{e - {Ah}}}_{l_{1}}.}} \right.$

On the other hand, the constraint Fx=Fe means that x=e−Ah for somehεR^(n) and, therefore,

$\left. \left( {P\;}_{1}^{\prime} \right)\Leftrightarrow{\min\limits_{h \in R^{n}}{x}_{l_{1}}} \right.,{x = \left. {e - {Ah}}\Leftrightarrow{\min\limits_{h \in R^{n}}{{e - {Ah}}}_{l_{1}}} \right.},$

which proves the claim.

The decoding procedure (P₁′) may also be re-expressed as a LinearProgram (LP)—hence the title of the present application. Indeed, thel₁-minimization problem is equivalent to

min1^(T) t, −t≦y−Ag≦t,   (1)

where the optimization variables are tεR^(m) and gεR^(n) (as isstandard, the generalized vector inequality x≦y means that x_(i)≦y_(i)for all i). As a result, the decoding procedure (P₁′) is a LP withinequality constraints and can be solved efficiently using standardoptimization algorithms, see [2].

Restricted Isometries

In the remainder of the present application, it will be convenient touse some linear algebra notations. We denote by (v_(j))_(jεJ)εR^(p) thecolumns of the matrix F and by H the Hilbert space spanned by thesevectors. Further, for any T

J, we let F_(T) be the submatrix with column indices jεT so that

${F_{r}c} = {{\sum\limits_{j \in T}^{\;}\; {c_{j}v_{j}}} \in {H.}}$

To introduce the notion of an almost orthonormal system, the applicantsfirst observe that if the columns of F are sufficiently “degenerate”,the recovery problem cannot be solved. In particular, if there exists anon-trivial sparse linear combination

${\sum\limits_{j \in T}^{\;}{c_{j}v_{j}}} = \; 0$

of the v_(j) which sums to zero, and T=T₁∪T₂ is any partition of T intotwo disjoint sets, then the vector

$y:={{\sum\limits_{j \in T_{1}}^{\;}\; {c_{J}v_{J}}} = {\sum\limits_{j \in T_{2}}^{\;}{\left( {- c_{j}} \right)v_{j}}}}$

has two distinct sparse representations. On the other hand, lineardependencies

${\sum\limits_{j \in J}^{\;}{c_{j}v_{j}}} = \; 0$

which involve a large number of nonzero coefficients c_(j), as opposedto a sparse set of coefficients, do not present an obvious obstructionto sparse recovery. At the other extreme, if the (v_(j))_(jεJ) are anorthonormal system, then the recovery problem is easily solved bysetting c_(j):=

f, v_(j)

In present application, if a “restricted orthonormality hypothesis”,which is far weaker than assuming orthonormality, is imposed, then (P₁)solves the recovery problem, even if the (v_(j))_(jεJ) are highlylinearly dependent (for instance, it is possible for m:=|J| to be muchlarger than the dimension of the span of the v_(j)'s). To make thisquantitative the following definition is introduced.

-   Definition. (Restricted isometry constants) Let F be the matrix with    the finite collection of vectors (v_(j))_(jεJ) εR^(p) as columns.    For every integer 1≦S≦|J|, we define the S-restricted isometry    constants δ_(s) to be the smallest quantity such that F_(T) obeys

(1−δ_(s))∥c∥ ² ≦∥F _(T) c∥ ²≦(1+δ_(S))∥c∥ ²   (2)

for all subsets T⊂J of cardinality at most S, and all real coefficients(c_(j))_(jεT). Similarly, we define the (S,S′)-restricted orthogonalityconstants θ_(S,S′) for S+S′≦|J| to be the smallest quantity such that

|

F _(T) c, F _(T′) c′

|≦θ _(S,S′) ·∥c∥∥c′∥  (3)

holds for all disjoint sets T, T′

J of cardinalities |T|≦S and |T′|≦S′.

The numbers δ_(S) and θ_(S,S′) measure how close the vectors v_(i) areto behaving like an orthonormal system, but only when restrictingattention to sparse linear combinations involving no more than Svectors. These numbers are clearly non-decreasing in S, S′. For S=1, thevalue δ₁ only conveys magnitude information about the vectors v_(j);indeed δ₁ is the best constant such that

1−δ₁ ≦∥v _(j)∥_(H) ²≦1+δ₁ for all jεJ.

In particular, δ₁=0 if and only if all of the v_(j) have unit length. Wewill later show that the higher δ_(S) control the orthogonality numbersθ_(S,S′):

-   Lemma 1. We have θ_(S,S′)≦δ_(S+S′)≦θ_(S,S,′)+max(δ_(S), δ_(S′)) for    all S, S′.

To see the relevance of the restricted isometry numbers δ_(S) to thesparse recovery problem, consider the following simple observation:

-   Lemma 2. Suppose that S≧1 is such that δ_(2S)<1, and let T⊂J be such    that |T|≦S. Let f:=F_(T)c for some arbitrary |T|-dimensional    vector c. Then the set T and the coefficients (c_(j))_(jεT) can be    reconstructed uniquely from knowledge of the vector f and the    v_(j)'s.-   Proof. We prove that there is a unique c with ∥c∥_(l) ₀ ≦S obeying

$f = {\sum\limits_{j}^{\;}\; {c_{j}{v_{j}.}}}$

Suppose for contradiction that f had two distinct sparse representationsf=F_(T)c=F_(T′)c′ where |T|, |T′|≦S. Then

F _(T∪T′) d=0, d _(j) :=c _(j)1_(jεT) −c′ _(j′)1_(jεT′).

Taking norms of both sides and applying (2) and the hypothesis δ_(2S)<1we conclude that ∥d∥²=0, contradicting the hypothesis that the tworepresentations were distinct.

Main Results

The previous lemma is an abstract existence argument which shows whatmight theoretically be possible, but does not supply any efficientalgorithm to recover T and c_(j) from f and the v_(j) other than bybrute force search—as discussed earlier. In contrast, the main result isthat, by imposing slightly stronger conditions on the restrictedisometry constants, the l₁-minimization program (P₁) decodes f exactly.

-   Theorem 3. Suppose that S≧1 is such that

δ_(S)+θ_(S)+θ_(S,2S)<1,   (4)

and let c be a real vector supported on a set T⊂J obeying |T|≦S. Putf:=Fc. Then c is the unique minimizer to

(P ₁)min∥d∥ _(l) ₁ ; Fd=f.

Note from Lemma 1 that (4) implies that δ_(2S)<1, and is in turn impliedby δ_(S)+δ_(2S)+δ_(3S)<¼. Thus the condition (4) is roughly “three timesas strict” as the condition required for Lemma 2.

Theorem 3 is inspired by the applicants' previous work [4], see also [3,5], but unlike those earlier results, the applicants' results here aredeterministic, and thus do not have a non-zero probability of failure,provided of course one can ensure that the v_(j) verify the condition(4). By virtue of the previous discussion, we have the companion result:

-   Theorem 4. Suppose F is such that FA=0 and let S≧1 be a number    obeying the hypothesis of Theorem 3. Set y=Af+e, where e is a real    vector supported on a set of size at most S. Then f is the unique    minimizer to

$\left( P_{1}^{\prime} \right){\min\limits_{g \in R^{n}}{{{y - {Ag}}}_{l_{1}}.}}$

Gaussian Random Matrices

An important question is to find matrices with good restricted isometryconstants, i.e. such that (4) holds for large values of S. Indeed, suchmatrices will tolerate a larger fraction of output in error while stillallowing exact recovery of the original input by linear programming. Howto construct such matrices might be delicate. In a later section,however, the applicants will argue that generic matrices, namely samplesfrom the Gaussian unitary ensemble obey (4) for relatively large valuesof S.

-   Theorem 5. Assume p≦m and let F be a p by m matrix whose entries are    i.i.d. Gaussian with mean zero and variance 1/p. Then the condition    of Theorem 3 holds with overwhelming probability provided that r=S/m    is small enough so that

r<r*(p,m)

where r*(p,m) is given in (12). (By “with overwhelming probability”, wemean with probability decaying exponentially in m.) In the limit oflarge samples, r*(p,m) only depends upon the ratio, and numericalevaluations show that the condition holds for r≦3.6×10⁻⁴ in the casewhere p/m=¾, r≦3.2×10⁻⁴ when p/m=⅔, and r≦2.3×10⁻⁴ when p/m=½.

In other words, Gaussian matrices are a class of matrices for which onecan solve an underdetermined systems of linear equations by minimizingl₁ provided, of course, the input vector has fewer than ρ·m nonzeroentries with ρ>0. The applicants mentioned earlier that this result issimilar to [9]. What is new and non-obvious here is that by using a verydifferent machinery, one can obtain explicit numerical values which werenot available before.

In the context of error correcting, the consequence is that a fractionof the output may be corrupted by arbitrary errors and yet, solving aconvex problem would still recover f exactly—a rather unexpected feat.

-   Corollary. Suppose A is an n by m Gaussian matrix and set p=m−n.    Under the hypotheses of Theorem 5, the solution to (P₁′) is unique    and equal to f.

This is an immediate consequence of Theorem 5. The only thing we need toargue is why we may think of the annihilator F (such that FA=0) as amatrix with independent Gaussian entries. Observe that the range of A isa random space of dimension n embedded in R^(m) so that the data {tildeover (y)}=Fe is the projection of e on a random space of dimension p.The range of a p by m matrix with independent Gaussian entries preciselyis a random subspace of dimension p, which justifies the claim.

The applicants would like to point out that the theoretical boundsderived in the present application are overly conservative; numericalexperiments have shown that the error tolerance that the algorithm canhandle in fact exceeds the theoretical bounds by one or two orders ofmagnitude.

Proof of Main Results

The main result, namely, Theorem 3 is proved by duality. As it will beseen in a later section, c is the unique minimizer if the matrix F_(T)has full rank and if one can find a vector w with the two properties

1.

w, v_(j)

=sgn(c_(j)) for all jεT,

2. and

w, v_(j)

<1 for all j∉T,

where sgn(c_(j)) is the sign of c_(j) (it equals zero when c_(j)=0). Thetwo conditions 1, 2 above say that a specific dual program is feasibleand is called the exact reconstruction property in [4]}, see also [3].For |T|≦S with S obeying the hypothesis of Theorem 3, F_(T) has fullrank since δ_(S)<1 and thus, the proof simply consists in constructing adual vector w; this is the object of the next section.

Exact Reconstruction Property

The applicants now examine the sparse reconstruction property and beginwith coefficients

w, v_(j)

for j∉T being only small in an l₂ sense.

-   Lemma 6. (Dual sparse reconstruction property, l₂ version)    Let S, S′≧1 be such that δ_(S)<1, and c be a real vector supported    on T⊂J such that |T|≦S. Then there exists a vector wεH such that    w, v_(j)    =c_(j) for all jεT. Furthermore, there is an “exceptional set” E    J which is disjoint from T, of size at most |E|≦S′, and with the    properties

${{{\langle{w,v_{j}}\rangle}} \leq {{\frac{\theta_{S,S^{\prime}}}{\left( {1 - \delta_{S}} \right)\sqrt{S^{\prime}}}.{c}}{\mspace{11mu} \;}{for}\mspace{14mu} {all}\mspace{14mu} j}} \notin {T\bigcup E}$${{and}{\mspace{11mu} \;}\left( {\sum\limits_{j \in E}^{\;}\; {{\langle{w,v_{j}}\rangle}}^{2}} \right)}^{1/2} \leq {\frac{\theta_{S}}{1 - \delta_{S}} \cdot {{c}.}}$

In addition, ∥w∥_(H)≦K·∥c∥ for some constant K>0 only depending uponδ_(S).

-   Proof. Recall that F_(T):l₂(T)→H is the linear transformation

${F_{T}c_{T}}:={\sum\limits_{j \in T}^{\;}\; {c_{j}v_{j}}}$

where c_(T):=(c_(j))_(jεT) (we use the subscript T in c_(T) to emphasizethat the input is a |T|-dimensional vector), and let F*_(T) be theadjoint transformation

F*_(T)w:=(

w, v_(j)

)_(jεT).

Property 1 gives

1−δ_(S)≦λ_(min)(F* _(T) F _(T))≦λ_(max)(F* _(T) F _(T))≦1+δ_(S),

where λ_(min) and λ_(max) are the minimum and maximum eigenvalues of thepositive-definite operator F*_(T)F_(T). In particular, since δ_(|T|)<1,we see that F*_(T)F_(T) is invertible with

$\begin{matrix}{{\left( {F_{T}^{*}F_{T}} \right)^{- 1}} \leq {\frac{1}{1 - \delta_{s}}.}} & (5)\end{matrix}$

Also note that

${{F_{T}\left( {F_{T}^{*}F_{T}} \right)}^{- 1}} \leq \frac{\sqrt{1 + \delta_{s}}}{1 - \delta_{s}}$

and set wεH to be the vector

w:=F _(T)(F* _(T) F _(T))⁻¹ c _(T);

it is then clear that F*_(T)w=c_(T), i.e.

w, v_(j)

=c_(j) for all jεT. In addition, ∥w∥≦K·∥c_(T)∥ with

$K:={\frac{\sqrt{1 + \delta_{s}}}{1 - \delta_{s}}.}$

Finally, if T′ is any set in J disjoint from T with |T′|≦S′ andd_(T′)=(d_(j))_(jεT′) is any sequence of real numbers, then (3) and (5)give

${{{\langle{{F_{T^{\prime}}^{*}w},d_{T^{\prime}}}\rangle}_{l_{2}{(T^{\prime})}}} = {{{\langle{w,{F_{T^{\prime}}^{*}d_{T^{\prime}}}}\rangle}_{l_{2}{(T^{\prime})}}} = {{{\langle{{\sum\limits_{j \in T}\; {\left( {\left( {F_{T}^{*}F_{T}} \right)^{- 1}c_{T}} \right)_{j}v_{j}}},{\sum\limits_{j \in T^{\prime}}{d_{j}v_{j}}}}\rangle}_{H}} \leq {\theta_{s,s^{\prime}} \cdot {{\left( {F_{T}^{*}F_{T}} \right)^{- 1}c_{T}}} \cdot {d_{r^{\prime}}}} \leq {\frac{\theta_{s,s^{\prime}}}{1 - \delta_{s}}{{c_{T}} \cdot {d_{T^{\prime}}}}}}}};$

since d_(T′) was arbitrary, we thus see from duality that

${{F_{T^{\prime}}^{*}w}}_{l_{2}{(T^{\prime})}} \leq {\frac{\theta_{s,s^{\prime}}}{1 - \delta_{s}}{{c_{T}}.}}$

In other words,

$\begin{matrix}{{\left( {\sum\limits_{j \in T^{\prime}}{{\langle{w,v_{j}}\rangle}}^{2}} \right)^{1/2} \leq {\frac{\theta_{s,s^{\prime}}}{1 - \delta_{s}}{c_{T}}\mspace{14mu} {whenever}}}{T^{\prime} \subseteq {{{J \smallsetminus T}\mspace{14mu} {and}\mspace{14mu} {T^{\prime}}} \leq {S^{\prime}.}}}} & (6)\end{matrix}$

If in particular if we set

${E:=\left\{ {j \in {J \smallsetminus {T:{{{\langle{w,v_{j}}\rangle}} > {\frac{\theta_{s,s^{\prime}}}{\left( {1 - \delta_{s}} \right)\sqrt{S^{\prime}}} \cdot {c_{T}}}}}}} \right\}},$

then |E| must obey |E|≦S′, since otherwise we could contradict (6) bytaking a subset T′ of E of cardinality S′. The claims now follow.

The applicants now derive a solution with better control on the sup normof |

w, v_(j)

| outside of T, by iterating away the exceptional set E (while keepingthe values on T fixed).

-   Lemma 7. (Dual sparse reconstruction property, l_(∞) version)    Let S≧1 be such that δ_(S)+θ_(S,2S)<1, and c be a real vector    supported on T⊂J obeying |T|≦S Then there exists a vector wεH such    that    w, v_(j)    =c_(j) for all jεT. Furthermore, w obeys

$\begin{matrix}{{{{\langle{w,v_{j}}\rangle}_{H}} \leq {{\frac{\theta_{s}}{\left( {1 - \delta_{s} - \theta_{s,{2\; s}}} \right)\sqrt{S}} \cdot {c}}\mspace{14mu} {for}\mspace{14mu} {all}\mspace{14mu} j}} \notin T} & (7)\end{matrix}$

-   Proof. We may normalize

${\sum\limits_{j \in T}{c_{j}}^{2}} = {\sqrt{S}.}$

Write T₀:=T. Using Lemma 6, we can find a vector w₁εH and a set T₁

J such that

T₀⋂T₁ = Ø T₁ ≤ S ⟨w₁, v_(j)⟩ = c_(j)  for  all  j ∈ T₀${{{\langle{w_{1},v_{j}}\rangle}_{H}} \leq {\frac{\theta_{s,s^{\prime}}}{1 - \delta_{s}}\mspace{14mu} {for}\mspace{14mu} {all}\mspace{14mu} j}} \notin {{T_{0}\bigcup{T_{1}\left( {\sum\limits_{j \in T_{1}}{{\langle{w_{1},v_{j}}\rangle}_{H}}^{2}} \right)}^{1/2}} \leq {\frac{\theta_{s}}{1 - \delta_{s}}\sqrt{S}}}$w₁_(H) ≤ K.

Applying Lemma 6 iteratively gives a sequence of vectors w_(n+1)εH andsets T_(n+1)

J for all n≧1 with the properties

T_(n)⋂(T₀⋃T_(n + 1)) = Ø T_(n + 1) ≤ S⟨w_(n + 1), v_(j)⟩_(H) = ⟨w_(n), v_(j)⟩_(H)  for  all  j ∈ T_(n)⟨w_(n + 1), v_(j)⟩_(H) = 0  for  all  j ∈ T₀${{{\langle{w_{n + 1},v_{j}}\rangle}_{H}} \leq {\frac{\theta_{s}}{1 - \delta_{s}}\; \left( \frac{\theta_{s,{2\; s}}}{1 - \delta_{s}} \right)^{n}\mspace{11mu} {for}\mspace{14mu} {all}\mspace{14mu} j}} \notin {{T_{0}\bigcup T_{n}\bigcup{T_{n + 1}\left( {\sum\limits_{j \in T_{n + 1}}{{\langle{w_{n + 1},v_{j}}\rangle}_{H}}^{2}} \right)}^{1/2}} \leq {\frac{\theta_{s}}{1 - \delta_{s}}\left( \frac{\theta_{s,{2\; s}}}{1 - \delta_{s}} \right)^{n}\sqrt{S}}}$${w_{n + 1}}_{H} \leq {\left( \frac{\theta_{s}}{1 - \delta_{s}} \right)^{n - 1}{K.}}$

By hypothesis, we have

$\frac{\theta_{s,{2\; s}}}{1 - \delta_{s}} \leq 1.$

Thus if we set

$w:={\sum\limits_{n = 1}^{\infty}\; {\left( {- 1} \right)^{n - 1}w_{n}}}$

then the series is absolutely convergent and, therefore, w is awell-defined vector in H. We now study the coefficients

$\begin{matrix}{{\langle{w,v_{j}}\rangle}_{H} = {\sum\limits_{n = 1}^{\infty}{\left( {- 1} \right)^{n - 1}{\langle{w_{n},v_{j}}\rangle}_{H}}}} & (8)\end{matrix}$

for jεJ.

Consider first jεT₀, it follows from the construction that

w₁, v_(j)

=c_(j) and

w_(n), v_(j)

=0 for all n>1, and hence

w, v_(j)

=c_(j) for all jεT₀.

Second, fix j with j∉T₀ and let I_(j):={n≧1:jεT_(n)). Since T_(n) andT_(n+1) are disjoint, we see that the integers in the set I_(j) arespaced at least two apart. Now if nεI_(j), then by definition jεT_(n)and, therefore,

w _(n+1) , v _(j)

=

w _(n) , v _(j)

In other words, the n and n+1 terms in (8) cancel each other out. Thuswe have

${\langle{w,v_{j}}\rangle}_{H} = {\sum\limits_{\underset{n,{{n - 1} \notin I_{j}}}{n \geq 1}}^{\infty}{\left( {- 1} \right)^{n - 1}{{\langle{w_{n},v_{j}}\rangle}_{H}.}}}$

On the other hand, if n,n−1∉I_(j) and n≠0, then j∉T_(n)∩T_(n−1) and

${{\langle{w_{n},v_{j}}\rangle}} \leq {\frac{\theta_{S}}{1 - \delta_{S}}\left( \frac{\theta_{S,{2S}}}{1 - \delta_{S}} \right)^{n - 1}}$

which by the triangle inequality and the geometric series formula gives

${{\sum\limits_{\underset{n,{{n - 1} \notin I_{j}}}{n \geq 1}}{\left( {- 1} \right)^{n - 1}{\langle{w_{n},v_{j}}\rangle}_{H}}}} \leq {\frac{\theta_{S}}{1 - \delta_{S} - \theta_{S,{2S}}}.}$

In conclusion,

${{{{\langle{w,v_{j}}\rangle}_{H} - {1_{0 \in I_{j}}{\langle{w_{0},v_{j}}\rangle}_{H}}}} \leq \frac{\theta_{S}}{1 - \delta_{S} - \theta_{S,{2S}}}},$

and since |T|≦S, the claim follows.

Lemma 7 actually solves the dual recovery problem. Indeed, theapplicants' result states that one can find a vector wεH obeying bothproperties 1 and 2 stated at the beginning of the section. To see why 2holds, observe that ∥sgn(c)∥=√{square root over (|T|)}≦√{square rootover (S)} and, therefore, (7) gives for all j∉T

${{\langle{w,v_{j}}\rangle}_{H}} \leq {\frac{\theta_{S}}{1 - \delta_{S} - \theta_{S,{2S}}} \cdot \sqrt{\frac{T}{S}}} \leq \frac{\theta_{S}}{1 - \delta_{S} - \theta_{S,{2S}}} < 1$

provided that (4) holds.

Proof of Theorem 3

Observe first that standard convex arguments give that there exists atleast one minimizer d=(d_(j))_(jεJ) to the problem (P₁). We need toprove that d=c. Since c obeys the constraints of this problem, d obeys

$\begin{matrix}{{{d}_{l_{1}} \leq {c}_{l_{1}}} = {\sum\limits_{j \in T}{{c_{j}}.}}} & (9)\end{matrix}$

Now take a w obeying properties 1 and 2 (see the remark following Lemma7). Using the fact that the inner product

w, v_(j)

is equal to the sign of c on T and has absolute value strictly less thanone on the complement, we then compute

${d}_{l_{1}} = {{{{\sum\limits_{j \in T}{{c_{j} + \left( {d_{j} - c_{j}} \right)}}} + {\sum\limits_{j \notin T}{d_{j}}}} \geq {{\sum\limits_{j \in T}{{{sgn}\left( c_{j} \right)}\left( {c_{j} + \left( {d_{j} - c_{j}} \right)} \right)}} + {\sum\limits_{j \notin T}{d_{j}{\langle{w,v_{j}}\rangle}_{H}}}}} = {{{\sum\limits_{j \in T}{c_{j}}} + {\sum\limits_{j \in T}{\left( {d_{j} - c_{j}} \right){\langle{w,v_{j}}\rangle}_{H}}} + {\sum\limits_{j \notin T}{d_{j}{\langle{w,v_{j}}\rangle}_{H}}}} = {{{\sum\limits_{j \in T}{c_{j}}} + {\langle{w,{{\sum\limits_{j \in J}{d_{j}v_{j}}} - {\sum\limits_{j \in T}{c_{j}v_{j}}}}}\rangle}_{H}} = {{{\sum\limits_{j \in T}{c_{j}}} + {\langle{w,{f - f}}\rangle}_{H}} = {\sum\limits_{j \in T}{{c_{j}}.}}}}}}$

Comparing this with (9) we see that all the inequalities in the abovecomputation must in fact be equality. Since

w, v_(j)

was strictly less than 1 for all j∉T, this in particular forces d_(j)=0for all j∉T. Thus

${\sum\limits_{j \in T}{\left( {d_{j} - c_{j}} \right)v_{j}}} = {{f - f} = 0.}$

Applying (2) (and noting from hypothesis that δ_(S)<1) we conclude thatd_(j)=c_(j) for all jεT. Thus d=c as claimed. This concludes the proofof the applicants' theorem.

It is likely that one may push the condition (4) a little further. Thekey idea is as follows. Each vector w_(n) in the iteration scheme usedto prove Lemma 7 was designed to annihilate the influence of w_(n−1) onthe exceptional set T_(n−1). But annihilation is too strong of a goal.It would be just as suitable to design w_(n) to moderate the influenceof w_(n−1) enough so that the inner product with elements in T_(n−1) issmall rather than zero.

Approximate Orthogonality

Lemma 1 gives control of the size of the principal angle betweensubspaces of dimension S and S′ respectively. This is useful because itallows one to guarantee exact reconstruction from the knowledge of the δnumbers only.

-   Proof of Lemma 1. The applicants first show that θ_(S,S′≦δ) _(S+S′).    By homogeneity it will suffice to show that

${{\langle{{\sum\limits_{j \in T}{c_{j}v_{j}}},{\sum\limits_{j^{\prime} \in T^{\prime}}{c_{j^{\prime}}^{\prime}v_{j^{\prime}}}}}\rangle}_{H}} \leq \delta_{S + S^{\prime}}$

whenever |T|≦S, |T′|≦S′, T, T′ are disjoint, and

${\sum\limits_{j \in T}{c_{j}}^{2}} = {{\sum\limits_{j^{\prime} \in T^{\prime}}{c_{j^{\prime}}^{\prime}}^{2}} = 1.}$

Now (2) gives

${2\left( {1 - \delta_{S + S^{\prime}}} \right)} \leq {{{\sum\limits_{j \in T}{c_{j}v_{j}}} + {\sum\limits_{j^{\prime} \in T^{\prime}}{c_{j^{\prime}}^{\prime}v_{j^{\prime}}}}}}_{H}^{2} \leq {2\left( {1 + \delta_{S + S^{\prime}}} \right)}$

together with

${2\left( {1 - \delta_{S + S^{\prime}}} \right)} \leq {{{\sum\limits_{j \in T}{c_{j}v_{j}}} + {\sum\limits_{j^{\prime} \in T^{\prime}}{c_{j^{\prime}}^{\prime}v_{j^{\prime}}}}}}_{H}^{2} \leq {2\left( {1 + \delta_{S + S^{\prime}}} \right)}$

and the claim now follows from the parallelogram identity.

It remains to show that δ_(S+S′)≦θ_(S,S′)+max(δ_(S), δ_(S′)). Again byhomogeneity, it suffices to establish that

${{{\langle{{\sum\limits_{j \in \overset{\sim}{T}}{c_{j}v_{j}}},{\sum\limits_{j^{\prime} \in \overset{\sim}{T}}{c_{j^{\prime}}v_{j^{\prime}}}}}\rangle}_{H} - 1}} \leq {{\max \left( {\delta_{S},\delta_{S^{\prime}}} \right)} + \theta_{S,S^{\prime}}}$

whenever |{tilde over (T)}|≦S+S′ and

${\sum\limits_{j \in \overset{\sim}{T}}{c_{j}}^{2}} = 1.$

To prove this property, we partition {tilde over (T)} as {tilde over(T)}=T∪T′ where |T|≦S and |T′|≦S′ and write

${\sum\limits_{j \in T}{c_{j}}^{2}} = {{\alpha \mspace{14mu} {and}\mspace{14mu} {\sum\limits_{j \in T^{\prime}}{c_{j}}^{2}}} = {1 - {\alpha.}}}$

(2) together with (3) give

$\mspace{20mu} {{\left( {1 - \delta_{S}} \right)\alpha} \leq {\langle{{\sum\limits_{j \in T}{c_{j}v_{j}}},{\sum\limits_{j^{\prime} \in T}{c_{j^{\prime}}v_{j^{\prime}}}}}\rangle}_{H} \leq {\left( {1 + \delta_{S}} \right){\alpha \mspace{20mu}\left( {1 - \delta_{S^{\prime}}} \right)}\left( {1 - \alpha} \right)} \leq {\langle{{\sum\limits_{j \in T}{c_{j}v_{j}}},{\sum\limits_{j^{\prime} \in T^{\prime}}{c_{j^{\prime}}v_{j^{\prime}}}}}\rangle}_{H} \leq {\left( {1 + \delta_{S^{\prime}}} \right)\left( {1 - \alpha} \right)}}$$\mspace{20mu} {{{\langle{{\sum\limits_{j \in T}{c_{j}v_{j}}},{\sum\limits_{j^{\prime} \in T^{\prime}}{c_{j^{\prime}}v_{j^{\prime}}}}}\rangle}_{H}} \leq {\theta_{S,S^{\prime}}{{\alpha^{1/2}\left( {1 - \alpha} \right)}^{1/2}.\mspace{20mu} {Hence}}}}$${{{\langle{{\sum\limits_{j \in \overset{\sim}{T}}{c_{j}v_{j}}},{\sum\limits_{j^{\prime} \in \overset{\sim}{T}}{c_{j^{\prime}}v_{j^{\prime}}}}}\rangle}_{H} - 1}} \leq {{\delta_{S}\alpha} + {\delta_{S^{\prime}}\left( {1 - \alpha} \right)} + {2\theta_{S,S^{\prime}}{\alpha^{1/2}\left( {1 - \alpha} \right)}^{1/2}}} \leq {{\max \left( {\delta_{S},\delta_{S^{\prime}}} \right)} + \theta_{S,S^{\prime}}}$

as claimed.

Gaussian Random Matrices

In this section, the applicants argue that with overwhelmingprobability, Gaussian random matrices have “good” isometry constants.Consider a p by m matrix F whose entries are i.i.d. Gaussian with meanzero and variance 1/p and let T be a subset of the columns. We wish tostudy the extremal eigenvalues of F*_(T)F_(T). Following upon the workof Marchenko and Pastur [18], Geman [15] and Silverstein [22] (see also[1]) proved that

λ_(min)(F*_(T)F_(T))→(1−√{square root over (γ)})²a.s.

λ_(max)(F*_(T)F_(T))→(1+√{square root over (γ)})²a.s.

in the limit where p and |T|→∞ with |T|/p→γ≦1. In other words, this saysthat loosely speaking and in the limit of large p, the restrictedisometry constant δ(F_(T)) for a fixed T behaves like

1−δ(F _(T))≦λ_(min)(F* _(T) F _(T))≦λ_(max)(F* _(T) F _(T))≦1+δ(F _(T)),δ(F _(T))≈2√{square root over (|T|/p)}+|T|/p.

Restricted isometry constants must hold for all sets T of cardinalityless or equal to S, and the applicants shall make use of concentrationinequalities to develop such a uniform bound. Note that for T′⊂T, oneobviously has

λ_(min)(F*_(T)F_(T))≦λ_(min)(F*_(T′)F_(T′))≦λ_(max)(F*_(T′)F_(T′))≦λ_(max)(F*_(T)F_(T))

and, therefore, attention may be restricted to matrices of size S. Now,there are large deviation results about the singular values of F_(T)[23]. For example, letting σ_(max)(F_(T)):=λ_(max)(F*_(T)F_(T))^(1/2) bethe largest singular value of F_(T), and defining σ_(min)(F_(T))similarly, Ledoux [18] applies the concentration inequality for Gaussianmeasures, and for a fixed t>0, obtains the deviation bounds

P(σ_(max)(F _(T))>1+√{square root over (|T|/p)}+o(1)+t)≦e ^(−pt) ² ^(/2)  (10)

P(σ_(min)(F _(T))<1−√{square root over (|T|/p)}−o(1)−t)≦e ^(−pt) ²^(/2);   (11)

here, o(2) is a small term tending to zero as p→∞ and which can becalculated explicitly, see [14]. For example, this last reference showsthat one can select o(2) in (10) as

$\frac{1}{2p^{1/3}} \cdot {{\gamma^{1/6}\left( {1 + \sqrt{\gamma}} \right)}^{2/3}.}$

Lemma 8. Put r:=S/m and set

f(r):=√{square root over (m/p)}·(√{square root over (r)}+√{square rootover (2H(r))}),

where H is the entropy function H(q):=−q log q−(1−q)log(1−q) defined for0<q<1. For each ε>0 the restricted isometry constant δ_(S) of a p by mGaussian matrix F obeys

P(1+δ_(S)>[1+(1+ε)f(r)]²)≦2·e ^(−mH(r)·ε/2).

-   Proof. As discussed above, we may restrict the applicants' attention    to sets |T|such that |T|=S. Denote by η_(p) the o(2)-term appearing    in either (10) or (11). Put λ_(max):λ_(max)(F*_(T)F_(T)) for short,    and observe that

${P\left( {{\sup\limits_{{T:{T}} = S}\lambda_{{ma}\; x}} > \left( {1 + \sqrt{S/p} + \eta_{p} + t} \right)^{2}} \right)} \leq {{\left\{ {{T\text{:}{T}} = S} \right\} }{P\left( {\lambda_{{ma}\; x} > \left( {1 + \sqrt{S/p} + \eta_{p} + t} \right)^{2}} \right)}} \leq {\begin{pmatrix}m \\S\end{pmatrix}{^{{- p}\; {t^{2}/2}}.}}$

From Stirling's approximation log m!=m log m−m+O(log m) we have thewell-known formula

${\log \begin{pmatrix}m \\S\end{pmatrix}} = {{m\; {H(r)}} + {O\left( {\log \; m} \right)}}$

which gives

${P\left( {{\sup\limits_{{T:{T}} = S}\lambda_{{ma}\; x}} > \left( {1 + \sqrt{S/p} + \eta_{p} + t} \right)^{2}} \right)} \leq {^{m\; {H{(r)}}} \cdot ^{O{({{lo}\; g\; m})}} \cdot {^{{- p}\; {t^{2}/2}}.}}$

The exact same argument applied to the smallest eigenvalues yields

${P\left( {{\inf\limits_{{T:{T}} = S}\lambda_{m\; i\; n}} < \left( {1 - \sqrt{S/p} - \eta_{p} - t} \right)^{2}} \right)} \leq {^{m\; {H{(r)}}} \cdot ^{O{({{lo}\; {gm}})}} \cdot {^{{- p}\; {t^{2}/2}}.}}$

Fix η_(p)+t=(1+ε)·√{square root over (m/p)}·√{square root over (2H(r))}.Assume now that m and p are large enough so that

$\eta_{p} \leq {\frac{ɛ}{2} \cdot \sqrt{m/p} \cdot {\sqrt{2{H(r)}}.}}$

Then

${P\left( {{\sup\limits_{T:{{T} + S}}\lambda_{m\; a\; x}} > \left( {1 + \sqrt{S/p} + {ɛ \cdot \sqrt{m/p} \cdot \sqrt{2{H(r)}}}} \right)^{2}} \right)} \leq ^{{- m}\; {{H{(r)}} \cdot {ɛ/2}}}$

where we used the fact that the term O(log m) is less than mεH(r)/2 forsufficiently large m. The same bound holds for the minimum eigenvaluesand the claim follows.

Ignoring the ε's, Lemma 8 states that with overwhelming probability

δ_(S)<−1+[1+f(r)]².

A similar conclusion holds for δ_(2S) and δ_(3S) and, therefore, weestablished that

$\begin{matrix}{{{\delta_{S} + \delta_{2S} + \delta_{3S}} < {\rho_{p/m}(r)}},{{\rho_{p/m}(r)} = {{\sum\limits_{j = 1}^{3}{- 1}} + {\left\lbrack {1 + {f\left( {j\; r} \right)}} \right\rbrack^{2}.}}}} & (12)\end{matrix}$

with very high probability. In conclusion, Lemma 1 shows that thehypothesis of the applicants' main theorem holds provided that the ratior=S/m be so small that ρ_(p/m)(r)<1$. In other words, in the limit oflarge samples, S/m may be taken as any value obeying ρ_(p/m)(S/m)<1which we used to give numerical values in Theorem 5. FIG. 1 graphs thefunction ρ_(p/m)(r) for several values of the ratio p/m.

Numerical Experiments

This section investigates the practical ability of l₁ to decode anobject fεR^(n) from corrupted encoded data y=Af+e, yεR^(m) (orequivalently, to recover the sparse vector of errors eεR^(m) from theunderdetermined system of equations Fe=zεR^(m−n)). The goal here is toevaluate empirically the location of the breakpoint as to get anaccurate sense of the performance one might expect in practice. In orderto do this, the applicants performed a series of experiments designed asfollows:

-   -   1. select n (the size of the input signal) and m so that with        the same notations as before, A is an n by m matrix; sample A        with independent Gaussian entries;    -   2. select S as a percentage of m;    -   3. select a support set T of size |T|=S uniformly at random, and        sample a vector e on T with independent and identically        distributed Gaussian entries. (Just as in [5], the results        presented here do not seem to depend on the actual distribution        used to sample the errors.);    -   4. perform the noisy encoding {tilde over (y)}=Ax+e of the data        x (the choice of x does not matter as is clear from the        discussion and here, x is also selected at random), apply the        decoding procedure (P₁′) and obtain x*;    -   5. compare x to x*;    -   6. repeat 100 times for each S and A;    -   7. repeat for various sizes of n and m.

The results are presented in FIG. 2 and FIG. 3. FIG. 2 examines thesituation in which the length of the code is twice that of the inputvector m=2n, for m=512 and m=1024. The applicants' experiments show thatone recovers the input vector all the time as long as the fraction ofthe corrupted entries is below 17%. This holds for m=512 (FIG. 2( a))and m=1024 (FIG. 2( b)). In FIG. 3, the applicants investigate how theseresults change as the length of the codewords increases compared to thelength of the input, and examine the situation in which m=4n, withm=512. The applicants' experiments show that one decodes the inputvector all the time as long as the fraction of the corrupted entries isbelow 34%.

Optimal Signal Recovery

The applicants' recent work [4] developed a set of ideas showing that itis surprisingly possible to reconstruct interesting classes of signalsaccurately from highly incomplete measurements. The results in thepresent application are inspired and improve upon this earlier work andthe applicants now elaborate on this connection. Suppose one wishes toreconstruct an object αεR^(m) from the K linear measurements

y_(k)=

α, φ_(k)

k=1, . . . K, i.e.y=Fα,   (13)

with φ_(k) equal to the k^(th) row of the matrix F. Of special interestis the vastly underdetermined case, K<<N, where there are many moreunknowns than observations. The applicants choose to formulate theproblem abstractly but for concreteness, one might think of α as thecoefficients α:Ψ*f of a digital signal or image f in some niceorthobasis, e.g. a wavelet basis so that the information about thesignal is of the form y=Fα=FΨ*f.

Suppose now that the object of interest is compressible in the sensethat the reordered entries of α decay like a power-law; concretely,suppose that the entries of α, rearranged in decreasing order ofmagnitude, |α|₍₁₎≧|α|₍₂₎≧ . . . ≧|α|_((m)), obey

|α|_((k)) ≦B·k ^(−s)   (14)

for some s≧1. The applicants will denote by F_(s)(B) the class of allsignals αεR^(m) obeying (14). The claim is that it is possible toreconstruct compressible signals from only a small number of randommeasurements.

-   Theorem 9. Let F be the measurement matrix as in (13) and consider    the solution α^(#) to

${\min\limits_{\overset{\sim}{\alpha} \in R^{n}}{{\overset{\sim}{\alpha}}_{l_{1}}\mspace{14mu} {subject}\mspace{14mu} {to}\mspace{20mu} F\; \overset{\sim}{\alpha}}} = {y.}$

Let S≦K such that δ_(S)+2θ_(S)+θ_(S,2S)<1 and set λ:K/S. Then α^(#)obeys

$\begin{matrix}{{\sup\limits_{\alpha \; \in {F_{s}\; {(B)}}}{{\alpha - \alpha^{\#}}}} \leq {C \cdot {\left( {K/\lambda} \right)^{- {({s - {1/2}})}}.}}} & (15)\end{matrix}$

To appreciate the content of the theorem, suppose one would haveavailable an oracle letting us know which coefficients α_(k),1≦k≦m, arelarge (e.g. in the scenario we considered earlier, the oracle would tellus which wavelet coefficients of f are large). Then we would acquireinformation about the K largest coefficients and obtain a truncatedversion α_(K) of α obeying

∥α−α_(K) ∥≧c·K ^(−(s−1/2)),

for generic elements taken from F_(s)(B). Now (15) says that not knowinganything about the location of the largest coefficients, one canessentially obtain the same approximation error by non-adaptivesampling, provided the number of measurements be increased by a factorλ. The larger S, the smaller the oversampling factor, and hence theconnection with the decoding problem. Such considerations make clearthat Theorem 9 supplies a very concrete methodology for recovering acompressible object from limited measurements and as such, it may have asignificant bearing on many fields of science and technology. Theapplicants refer the reader to [4] and [10] for a discussion of itsimplications.

Suppose for example that F is a Gaussian random matrix as discussed inan earlier section. We will assume the same special normalization sothat the variance of each individual entry is equal to 1/K. Calculationsidentical to those from earlier sections give that with overwhelmingprobability, F obeys the hypothesis of the theorem provided that

S≦K/λ, λ=ρ·log(m/K),

for some positive constant ρ>0. Now consider the statement of thetheorem; there is a way to invoke linear programming and obtain areconstruction based upon O(K log m/K) measurements only, which is atleast as good as that one would achieve by knowing all the informationabout f and selecting the K largest coefficients. In fact, this is anoptimal statement as (15) correctly identifies the minimum number ofmeasurements needed to obtain a given precision. In short, it isimpossible to obtain a precision of about K^(−(s−1/2)) using only O(Klog m/K) measurements, see [4, 10].

Theorem 9 is stronger than the applicants' former result, namely,Theorem 1.4 in [4]. To see why this is true, recall the former claim:[4] introduced two conditions, the uniform uncertainty principle (UUP)and the exact reconstruction principle (ERP). In a nutshell, a randommatrix F obeys the UUP with oversampling factor λ if F obeys

δ_(S)≦1/2, S=ρ·K/λ,   (16)

with probability at least 1−O(N^(γ/ρ)) for some fixed positive constantγ>0. Second, a measurement matrix F obeys the ERP with oversamplingfactor λ if for each fixed subset T of size |T|≦S (16) and each “sign”vector c defined on T, there exists with the same overwhelmingly largeprobability a vector wεH with the following properties:

1.

w, v_(j)

=c_(j), for all jεT.

2. and

w, v_(j)

<½ for all j not in T.

Note that these are the conditions 1, 2 studied in previous sections,except for the ½ factor on the complement of T. Fix αεF_(s)(B). In [4]it was argued that if a random matrix obeyed the UUP and the ERP bothwith oversampling factor λ, then

∥α−α^(#) ∥≦C·(K/λ)^(−s−1/2)),

with inequality holding with the same probability as before. Againstthis background, several comments are now in order:

-   -   First, the new statement is more general as it applies to all        matrices, not just random matrices.    -   Second, whereas the applicants' previous statement argued that        for each αεR^(m), one would be able—with high probability—to        reconstruct a accurately, it did not say anything about the        worst case error for a fixed measurement matrix F. This is an        instance where the order of the quantifiers plays a role. Do we        need different F's for different objects? Theorem 9 answers this        question unambiguously; the same F will provide an optimal        reconstruction for all the objects in the class.    -   Third, Theorem 9 says that the ERP condition is redundant, and        hence the hypothesis may be easier to check in practice. In        addition, eliminating the ERP isolates the real reason for        success as it ties everything down to the UUP. In short, the        ability to recover an object from limited measurements depends        on how close F is to an orthonormal system, but only when        restricting attention to sparse linear combinations of columns.

The applicants will not prove this theorem as this is a minormodification of that of Theorem 1.4 in the aforementioned reference. Thekey point is to observe that if F obeys the hypothesis of theapplicants' theorem, then by definition F obeys the UUP with probabilityone, but F also obeys the ERP, again with probability one, as this isthe content of Lemma 7. Hence both the UUP and ERP hold and therefore,the conclusion of Theorem 9 follows. (The fact that the ERP actuallyholds for all sign vectors of size less than S is the reason why (15)holds uniformly over all elements taken from F_(s)(B), see [4].)

Discussion Improvements

More elaborate arguments can yield versions of Theorem 5 with tighterbounds. Immediately following the proof of Lemma 7, we already remarkedthat one might slightly improve the condition (4) at the expense ofconsiderable complications. The applicants will now discuss improvementsin this direction using tools from Random Matrix Theory.

The applicants' main hypothesis in (4) reads δ_(S)+θ_(S)+θ_(S,2S)<1, butin order to reduce the problem to the study of those δ numbers (and useknown results), the applicants' analysis actually relied upon the morestringent condition δ_(S)+δ_(2S)+δ_(3S)<1 instead, since

δ_(S)+θ_(S)+θ_(S,2S)≦δ_(S)+δ_(2S)+δ_(3S).

This introduces a gap. Consider a fixed set T of size |T|=S. Using thenotation of previous sections, the applicants argued that

δ(F_(T))≈2√{square root over (S/p)}+S/p,

and developed a large deviation bound to quantify the departure from theright hand-side. Now let T and T′ be two disjoint sets of respectivesizes S and S′ and consider θ(F_(T), F_(T′)), which is the cosine of theprincipal angle between the two random subspaces spanned by the columnsof F_(T) and F_(T′) respectively; formally

θ(F _(T) , F _(T′)):=sup

u, u′

uεspan(F _(T)), u′εspan(F _(T)′), ∥u∥=∥u′∥=1.

The applicants remark that this quantity plays an important analysis instatistical analysis because of its use to test the significance ofcorrelations between two sets of measurements, compare the literature onCanonical Correlation Analysis [20]. Among other things, it is known[25] that

θ(F_(T), F_(T′))→√{square root over (γ(1−γ′))}+√{square root over(γ′(1−γ))} a.s.

as p→∞ with S/p→γ and S′/p→γ′. In other words, whereas we used thelimiting behaviors

δ(F_(2T))→2√{square root over (2γ)}+2γ, δ(F_(3T))→2√{square root over(3γ)}+3γ,

there is a chance one might employ instead

θ(F_(T), F_(T′))→2√{square root over (γ(1−γ))}, θ(F_(T),F_(T′))→√{square root over (γ(1−2γ))}+√{square root over (2γ(1−γ))}

for |T|=|T′|=S and |T′|=2|T|=2S respectively, which is better. Just asin previous sections, one might then look for concentration inequalitiestransforming this limiting behavior into corresponding large deviationinequalities. Recent work of Johnstone and his colleagues [17] might beof help here.

Finally, tighter large deviation bounds might exist together with moreclever strategies to derive uniform bounds (valid for all T of size lessthan S) from individual bounds (valid for a single T). With this inmind, it is interesting to note that the applicants' approach hits alimit as

$\begin{matrix}{{{{\lim\limits_{\underset{{S/m}->r}{S->\infty}}{\inf \; \delta_{S}}} + \theta_{S} + \theta_{S,{2S}}} \geq {J\left( {\frac{m}{p} \cdot r} \right)}},} & (17)\end{matrix}$

where J(r):=2√{square root over (r)}+r+(2+√{square root over(2)})√{square root over (r(1−r))}+√{square root over (r(1−2r))}. SinceJ(r) is greater than 1 if and only if r>2.36, one would certainly neednew ideas to improve Theorem 5 beyond cut-off point in the range ofabout 2%. The lower limit (17) is probably not sharp since it does notexplicitly take into account the ratio between m and p; at best, itmight serve as an indication of the limiting behavior when the ratio p/mis not too small.

Other Coding Matrices

The present application introduced general results stating that it ispossible to correct for errors by l₁-minimization. We then explained howthe results specialize in the case where the coding matrix A is sampledfrom the Gaussian ensemble. It is clear, however, that an expert inthese methods could use other matrices and still obtain similar results;namely, that (P₁′) recovers f exactly provided that the number ofcorrupted entries does not exceed ρ·m. In fact, the applicants' previouswork suggests that partial Fourier matrices would enjoy similarproperties [5, 4]. Other candidates might be the so-called noiselets ofCoifman, Geshwind and Meyer [7]. These alternative might be of greatpractical interest because they would come with fast algorithms forapplying A or A* to an arbitrary vector g and, hence, speed up thecomputations to find the l₁-minimizer.

CONCLUSIONS

The present disclosure is about the classical error correcting problemwhich is frequently discussed in communication and coding theory, Theproblem is as follows: a vector fεR^(n) (i.e. a sequence f=(f₁, f₂, . .. , f_(n)) of n real numbers) is encoded and then transmitted using alinear code, i.e. a collection of codewords which are vectors a₁, a₂, .. . , a_(n)εR^(m)—the columns of a matrix A. Given a vector fεR^(n) (the“plaintext”) we can then generate a vector Af in R^(m) (the“ciphertext”). A recurrent problem with real communication or storagedevices is that some of the entries of the ciphertext Af may becomecorrupted; when this is the case, the corrupted entries are unreliableand may have nothing to do with the original values. In general, it isnot known which entries are affected, nor is it known how they areaffected. The applicants' disclosure shows that no matter what thecorruption is like, it is provably possible to decode the plaintext fromthe corrupted ciphertext by solving a convenient linear program, withsuccess guaranteed as long as that the fraction of corrupted entries innot excessively large. The algorithm proceeds by solving the convexoptimization program

$\min\limits_{x \in R^{n}}{{y - {Ax}}}_{l_{1}}$

where y is the vector of corrupted encoded data and

${x}_{l_{1}}:={\sum\limits_{i}{{x_{i}}.}}$

The following are two key features of the decoding algorithm presentedin this disclosure:

-   -   1. Tractability. The optimization program is equivalent a linear        program (1). This observation is of practical significance since        it enables large scale decoding problems to be solved rapidly        and efficiently using standard off-the-shelf optimization        software.    -   2. Performance. The applicants performed numerical experiments        which show that in practice, the decoding algorithm of the        present disclosure works extremely well when the coding matrix        is generated at random. Suppose for example, that the number of        codewords m is 4 times the length of the plaintext (m=4n). The        applicants' numerical experiments have shown that the plaintext        is recovered exactly all the time as long as the fraction of the        corrupted entries is below about 34% of the size of the        ciphertext. (In contrast, it is known that accurate decoding is        impossible if the corruption rate is 50% or higher).

This invention has many potential applications as efficienterror-correction codes are in widespread use in many fields oftechnology. Examples include the transmission of Internet media,Internet telephony, and speech transmission. In addition, errorcorrection is embedded as a key building block in numerous algorithms,and data-structures, where corruption is possible; corruption of digitaldata stored on a hard-drive, CD, DVD or similar media is a good example.In short, progress in error correction has potential to impact severalstorage and communication systems.

Where several illustrative embodiments of the invention have been shownand described above, numerous variations and alternative embodimentswill occur to experts. Such variations and alternative embodiments arecontemplated and can be made without departing from the scope of theinvention as defined in the appended claims.

REFERENCES

-   [1] Z. D. Bai and Y. Q. Yin. Limit of the smallest eigenvalue of a    large-dimensional sample covariance matrix. Ann. Probab. 21 (1993),    1275-1294.-   [2] S. Boyd, and L. Vandenberghe, Convex Optimization, Cambridge    University Press, 2004.-   [3] E. J. Candès, J. Romberg, and T. Tao, Robust uncertainty    principles: exact signal reconstruction from highly incomplete    frequency information. To appear, IEEE Transactions on Information    Theory. Available at http://arxiv.org/abs/math.NA/0409186-   [4] E. J. Candès, and T. Tao, Near optimal signal recovery from    random projections: universal encoding strategies? Submitted to IEEE    Transactions on Information Theory, October 2004. Available at    http://arxiv.org/abs/math.CA/0410542-   [5] E. J. Candès, and J. Romberg, Quantitative Robust Uncertainty    Principles and Optimally Sparse Decompositions. To appear,    Foundations of Computational Mathematics, November 2004. Available    at http://arxiv.org/abs/math.CA/0411273-   [7] S. S. Chen. Basis Pursuit. Stanford Ph.˜D.˜Thesis, 1995.-   [6] S. S. Chen, D. L. Donoho, and M. A. Saunders. Atomic    decomposition by basis pursuit. SIAM J. Scientific Computing 20    (1999), 33-61.-   [7] R. Coifman, F. Geshwind, and Y. Meyer. Noiselets. Appl. Comput.    Harmon. Anal. 10 (2001), 27-44.-   [8] K. R. Davidson and S. J. Szarek. Local operator theory, random    matrices and Banach spaces. In: Handbook in Banach Spaces, Vol I,    ed. W. B. Johnson, J. Lindenstrauss, Elsevier (2001), 317-366.-   [9] D. L. Donoho. For Most Large Underdetermined Systems of Linear    Equations the Minimal l ₁-norm Solution is also the Sparsest    Solution. Manuscript, September 2004. Available at    http://www-stat.stanford.edu/˜donoho/reports.html-   [10] D. L. Donoho, Compressed Sensing, Manuscript, September 2004.    Available at http://www-stat.stanford.edu/˜donoho/reports.html-   [11] D. L. Donoho and X. Huo. Uncertainty principles and ideal    atomic decomposition. IEEE Transactions on Information Theory, 47    (2001), 2845-2862.-   [12] D. L. Donoho and M. Elad. Optimally sparse representation in    general (nonorthogonal) dictionaries via l ₁ minimization, Proc.    Natl. Acad. Sci. USA, 100 (2003), 2197-2202.-   [13] M. Elad and A. M. Bruckstein. A generalized uncertainty    principle and sparse representation in pairs of R ^(N) bases. IEEE    Transactions on Information Theory, 48 (2002), 2558-2567.-   [14] N. El Karoui. New Results about Random Covariance Matrices and    Statistical Applications. Stanford Ph.D. Thesis, August 2004.-   [15] S. Geman. A limit theorem for the norm of random matrices. Ann.    Probab. 8 (1980), 252-261.-   [16] R. Gribonval and M. Nielsen. Sparse representations in unions    of bases. IEEE Trans. Inform. Theory 49 (2003), 3320-3325.-   [17] I. M. Johnstone. Large covariance matrices. Third 2004 Wald    Lecture, 6th World Congress of the Bernoulli Society and 67th Annual    Meeting of the IMS, Barcelona, July 2004.-   [18] M. Ledoux. The concentration of measure phenomenon.    Mathematical Surveys and Monographs 89, American Mathematical    Society, Providence, R.I., 2001.-   [19] V. A. Marchenko and L. A. Pastur. Distribution of eigenvalues    in certain sets of random matrices. Mat. Sb. (N.S.)} 72 (1967),    407-535 (in Russian).-   [20] R. J. Muirhead. Aspects of multivariate statistical theory.    Wiley Series in Probability and Mathematical Statistics. John Wiley    & Sons, Inc., New York, 1982.-   [21] B. K. Natarajan. Sparse approximate solutions to linear    systems. SIAM J. Comput. 24 (1995), 227-234.-   [22] J. W. Silverstein. The smallest eigenvalue of a large    dimensional Wishart matrix. Ann. Probab. 13 (1985), 1364-1368.-   [23] S. J. Szarek. Condition numbers of random matrices. J.    Complexity 7 (1991), 131-149.-   [24] J. A. Tropp. Greed is good: Algorithmic results for sparse    approximation. Technical Report, the University of Texas at Austin,    2003.-   [25] K. W. Wachter. The limiting empirical measure of multiple    discriminant ratios. Ann. Statist. 8 (1980), 937-957.

1.-9. (canceled)
 10. A computer system for simulating a physicalprocess, comprising: means for storing in memory a received vector y,said received vector being corrupted by error; means for storing inmemory a linear code A used to transmit said received vector; means forstoring in memory a solution vector x; means for performing operationson A, x and y, the operations comprising performing${\min\limits_{x \in R^{n}}{{y - {Ax}}}_{l_{1}}},$ wherein l₁represents a norm such that${x}_{l_{1}}:={\sum\limits_{i}{{x_{i}}.}}$
 11. (canceled)
 12. Thecomputer system of claim 10, wherein the linear code A is a randommatrix.
 13. The computer system of claim 10, wherein the linear code Ais sampled from a Gaussian ensemble.
 14. The computer system of claim10, wherein the vector includes data relating to a field selected fromInternet media, Internet telephony and speech transmission.